The Penn Chinese TreeBank: Phrase structure annotation of a large corpus
نویسندگان
چکیده
With growing interest in Chinese Language Processing, numerous NLP tools (e.g., word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since no large-scale bracketed corpora are available to the public, these tools are trained on corpora with di erent segmentation criteria, part-of-speech tagsets and bracketing guidelines, and therefore, comparisons are diÆcult. As a rst step towards addressing this issue, we have been preparing a large bracketed corpus since late 1998. The rst two installments of the corpus, 250 thousand words of data, fully segmented, POS-tagged and syntactically bracketed, have been released to the public via LDC (www.ldc.upenn.edu). In this paper, we discuss several Chinese linguistic issues and their implications for our treebanking e orts and how we address these issues when developing our annotation guidelines. We also describe our engineering strategies to improve speed while ensuring annotation quality.
منابع مشابه
C-structures and F-structures for the British National Corpus
We describe how the British National Corpus (BNC), a one hundred million word balanced corpus of British English, was parsed into Lexical Functional Grammar (LFG) c-structures and f-structures, using a treebank-based parsing architecture. The parsing architecture uses a state-of-the-art statistical parser and reranker trained on the Penn Treebank to produce context-free phrase structure trees, ...
متن کاملSemi-automatically Developing Chinese HPSG Grammar from the Penn Chinese Treebank for Deep Parsing
In this paper, we introduce our recent work on Chinese HPSG grammar development through treebank conversion. By manually defining grammatical constraints and annotation rules, we convert the bracketing trees in the Penn Chinese Treebank (CTB) to be an HPSG treebank. Then, a large-scale lexicon is automatically extracted from the HPSG treebank. Experimental results on the CTB 6.0 show that a HPS...
متن کاملPenn Korean Treebank : Development and Evaluation
With growing interest in Korean language processing, numerous natural languages processing (NLP) tools for Korean, such as part-of-speech (POs) taggers, morphological analyzers , parsers, have been developed. This progress was possible through the availability of large-scale raw text corpora and POS tagged corpora (ETRI, 1999; Yoon and Choi, 1999a; Yoon and Choi, 1999b). However, no large-scale...
متن کاملDeep Context-Free Grammar for Chinese with Broad-Coverage
The accuracy of Chinese parsers trained on Penn Chinese Treebank is evidently lower than that of the English parsers trained on Penn Treebank. It is plausible that the essential reason is the lack of surface syntactic constraints in Chinese. In this paper, we present evidences to show that strict deep syntactic constraints exist in Chinese sentences and such constraints cannot be effectively de...
متن کاملSemi-automatic Annotation of Chinese Word Structure
Chinese word structure annotation is potentially useful for many NLP tasks, especially for Chinese word segmentation. Li and Zhou (2012) have presented an annotation for word structures in the Penn Chinese Treebank. But they only consider words that have productive affixes, which covers 35% of word types in that corpus. In this paper, we propose a linguistically inspired annotation that covers ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Natural Language Engineering
دوره 11 شماره
صفحات -
تاریخ انتشار 2005